You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR rebases the Apple M5 Max 4096-prefill correctness/scheduling work onto current upstream main (613e9b2 at benchmark/rebase time). Non-M5 devices keep the existing 2048-token default.
Fresh paired comparison against current antirez/main on an Apple M5 Max 128GB machine, Metal backend, ds4flash.gguf:
benchmark
current main
this PR
result
4096-step sweep, avg prefill
284.42 t/s
291.56 t/s
+2.51%
4096-step sweep, avg generation
28.72 t/s
28.48 t/s
neutral / -0.86%
README 65k sweep, avg prefill
256.99 t/s
250.38 t/s
neutral / -2.57%
README 65k sweep, avg generation
27.73 t/s
27.56 t/s
neutral / -0.61%
The safe claim for this PR is: it enables and correctness-gates the M5 Max 4096-token prefill path, with a small 4096-sweep prefill win in this fresh run and otherwise neutral throughput. The larger decode win is kept separate in follow-up #169.
What changed
Adds Apple M5-gated Metal runtime fast paths:
simdgroup matrix matmul specialization
private Metal scratch buffers for GPU-only hot intermediates, keeping hazard tracking enabled
Makes 4096-token prefill chunks the default only on Apple M5 Max Metal.
Keeps other devices/backends on the existing 2048-token default.
Makes the 4096-token path correctness-safe by splitting the zero-prefix first chunk at the existing 2048-token correctness boundary. This avoids selecting compressed top-k rows from future causal positions.
Aligns server KV disk-cache boundaries to the backend prefill chunk:
M5 Max Metal: 4096
other devices/backends: 2048
DS4_METAL_PREFILL_CHUNK=2048 forces the previous M5 Max chunk size. Values above 4096 still require DS4_METAL_ALLOW_UNSAFE_PREFILL_CHUNK=1 on the M5 Max default path.
Correctness
Passed locally on Apple M5 Max, 128GB RAM, Metal backend, ds4flash.gguf, after rebasing to current upstream main:
make clean && make
make test
DS4_METAL_PREFILL_CHUNK=4096 ./ds4_test --logprob-vectors
make test covers:
--long-context
--tool-call-quality
--logprob-vectors
--metal-kernels
--server
Eval parity check
Deterministic 12-question ds4-eval slice against current upstream main produced identical grading decisions and token counts on both branches:
The 4096 default increases Metal context-buffer allocation but keeps it modest for the tested M5 Max class machine.
From ds4-bench context buffer reporting at the README 65k allocation:
chunk
context buffers
2048
1311.89 MiB
4096
1740.42 MiB
That is about +0.4 GiB. Other devices keep the old 2048 default.
Scope notes
This PR is intentionally limited to runtime Metal changes and the M5 Max prefill default. The decode-indexer speedup is kept in #169.
fitchmultz
changed the title
Metal: add M5 Max fast paths and 4096 prefill default
Metal: speed up M5 Max prefill with correctness-gated 4096 chunks
May 14, 2026
fitchmultz
changed the title
Metal: speed up M5 Max prefill with correctness-gated 4096 chunks
Metal: correctness-gate M5 Max 4096 prefill (+5%)
May 14, 2026
Opened a stacked follow-up for the M5 Max decode-indexer tuning: #169. It depends on this PR and includes a clean stacked diff plus local correctness/benchmark evidence (+15–18% generation t/s, prefill neutral). I kept this PR unchanged.
Rebased this PR onto current upstream main (613e9b2 at benchmark/rebase time), force-pushed m5-responses, and updated the body with fresh paired benchmarks plus deterministic ds4-eval parity. Fresh result vs current main: 4096 prefill 284.42 → 291.56 t/s (+2.51%), 65k prefill effectively neutral/noisy, generation neutral. Eval slice is unchanged: 10/12 on both branches, same token counts and same two failures. The larger decode throughput win remains isolated in #169.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Result
This PR rebases the Apple M5 Max 4096-prefill correctness/scheduling work onto current upstream
main(613e9b2at benchmark/rebase time). Non-M5 devices keep the existing 2048-token default.Fresh paired comparison against current
antirez/mainon an Apple M5 Max 128GB machine, Metal backend,ds4flash.gguf:The safe claim for this PR is: it enables and correctness-gates the M5 Max 4096-token prefill path, with a small 4096-sweep prefill win in this fresh run and otherwise neutral throughput. The larger decode win is kept separate in follow-up #169.
What changed
DS4_METAL_PREFILL_CHUNK=2048forces the previous M5 Max chunk size. Values above 4096 still requireDS4_METAL_ALLOW_UNSAFE_PREFILL_CHUNK=1on the M5 Max default path.Correctness
Passed locally on Apple M5 Max, 128GB RAM, Metal backend,
ds4flash.gguf, after rebasing to current upstreammain:make testcovers:--long-context--tool-call-quality--logprob-vectors--metal-kernels--serverEval parity check
Deterministic 12-question
ds4-evalslice against current upstreammainproduced identical grading decisions and token counts on both branches:antirez/main613e9b2b3d2665The same two cases failed on both branches with the same extracted answers, so this eval slice shows no quality regression.
Benchmark commands
4096-step sweep:
README-shaped 65k sweep:
Memory
The 4096 default increases Metal context-buffer allocation but keeps it modest for the tested M5 Max class machine.
From
ds4-benchcontext buffer reporting at the README 65k allocation:That is about +0.4 GiB. Other devices keep the old 2048 default.
Scope notes
This PR is intentionally limited to runtime Metal changes and the M5 Max prefill default. The decode-indexer speedup is kept in #169.